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ABSTRACT 

Sparse  representation  models  a  signal  as  a  linear 
combination  of  a  small  number  of  dictionary  atoms.  As  a  generative 
model,  it  requires  the  dictionary  to  be  highly  redundant 
in  order  to  ensure  both  a  stable  high  sparsity  level  and  a 
low  reconstruction  error  for  the  signal.  However,  in  practice, 
this  requirement  is  usually  impaired  by  the  lack  of  labelled 
training  samples.  Fortunately,  previous  research  has  shown  that 
the  requirement  for  a  redundant  dictionary  can  be  less  rigorous 
if  simultaneous  sparse  approximation  is  employed,  which  can  be 
carried  out  by  enforcing  various  structured  sparsity  constraints 
on  the  sparse  codes  of  the  neighboring  pixels.  In  addition, 
numerous  works  have  shown  that  applying  a  variety  of  dictionary 
learning  methods  for  the  sparse  representation  model  can  also  improve 
the  classification  performance.  In  this  paper,  we  highlight 
the  task-driven  dictionary  learning  algorithm,  which  is  a  general 
framework  for  the  supervised  dictionary  learning  method.  We 
propose  to  enforce  structured  sparsity  priors  on  the  task-driven 
dictionary  learning  method  in  order  to  improve  the  performance 
of  the  hyperspectral  classification.  Our  approach  is  able  to 
benefit  from  both  the  advantages  of  the  simultaneous  sparse 
representation  and  those  of  the  supervised  dictionary  learning. 

We  enforce  two  different  structured  sparsity  priors,  the  joint 
and  Laplacian  sparsity,  on  the  task-driven  dictionary  learning 
method  and  provide  the  details  of  the  corresponding  optimization 
algorithms.  Experiments  on  numerous  popular  hyperspectral 
images  demonstrate  that  the  classification  performance  of  our 
approach  is  superior  to  sparse  representation  classifier  with 
structured  priors  or  the  task-driven  dictionary  learning  method. 
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Task-Driven  Dictionary  Learning  for  Hyperspectral 
Image  Classification  with  Structured  Sparsity 

Constraints 

Xiaoxia  Sun,  Nasser  M.  Nasrabadi,  Fellow,  IEEE,  and  Trac  D.  Tran,  Fellow,  IEEE 


Abstract — Sparse  representation  models  a  signal  as  a  linear 
combination  of  a  small  number  of  dictionary  atoms.  As  a  gener¬ 
ative  model,  it  requires  the  dictionary  to  be  highly  redundant 
in  order  to  ensure  both  a  stable  high  sparsity  level  and  a 
low  reconstruction  error  for  the  signal.  However,  in  practice, 
this  requirement  is  usually  impaired  by  the  lack  of  labelled 
training  samples.  Fortunately,  previous  research  has  shown  that 
the  requirement  for  a  redundant  dictionary  can  be  less  rigorous 
if  simultaneous  sparse  approximation  is  employed,  which  can  be 
carried  out  by  enforcing  various  structured  sparsity  constraints 
on  the  sparse  codes  of  the  neighboring  pixels.  In  addition, 
numerous  works  have  shown  that  applying  a  variety  of  dictionary 
learning  methods  for  the  sparse  representation  model  can  also  im¬ 
prove  the  classification  performance.  In  this  paper,  we  highlight 
the  task-driven  dictionary  learning  algorithm,  which  is  a  general 
framework  for  the  supervised  dictionary  learning  method.  We 
propose  to  enforce  structured  sparsity  priors  on  the  task-driven 
dictionary  learning  method  in  order  to  improve  the  performance 
of  the  hyperspectral  classification.  Our  approach  is  able  to 
benefit  from  both  the  advantages  of  the  simultaneous  sparse 
representation  and  those  of  the  supervised  dictionary  learning. 
We  enforce  two  different  structured  sparsity  priors,  the  joint 
and  Laplacian  sparsity,  on  the  task-driven  dictionary  learning 
method  and  provide  the  details  of  the  corresponding  optimization 
algorithms.  Experiments  on  numerous  popular  hyperspectral 
images  demonstrate  that  the  classification  performance  of  our 
approach  is  superior  to  sparse  representation  classifier  with 
structured  priors  or  the  task-driven  dictionary  learning  method. 

Index  Terms — Sparse  representation,  supervised  dictionary 
learning,  task-driven  dictionary  learning,  joint  sparsity,  Lapla¬ 
cian  sparsity,  hyperspectral  imagery  classification 


I.  Introduction 

CLASSIFICATION  on  Hyperspectral  Imagery  (HSI)  is 
becoming  increasingly  popular  in  remote  sensing.  No¬ 
table  applications  include  military  aerial  surveillance  Q- 
mineral  identification  and  material  defects  detection  0. 
However,  numerous  difficulties  impede  the  improvement  of 
HSI  classification  performance.  For  instance,  the  high  dimen¬ 
sionality  of  HSI  pixels  introduce  the  problem  of  the  ‘curse 
of  dimensionality’  0.  and  the  classifier  is  always  confronted 
with  the  overfitting  problem  due  to  the  small  number  of 

X.Sun  and  T.  D.  Tran  are  with  the  Department  of  Electrical  and  Computer 
Engineering,  The  Johns  Hopkins  University,  Baltimore,  MD  21218  USA  (e- 
mail:  xsun9@jhu.edu;  trac@jhu.edu).  This  work  has  been  partially  supported 
by  NSF  under  Grants  CCF-1 1 17545,  ARO  under  Grants  60219-MA,  and  ONR 
under  grant  N00014 1210765. 

N.  M.  Nasrabadi  is  with  U.S.  Army  Research  Laboratory,  Adelphi,  MD 
20783  USA  (e-mail:  nnasraba@arl.army.mil). 


labelled  samples.  Additionally,  most  HSI  pixels  are  indiscrim- 
inative  since  they  are  undesirably  highly  coherent  |6|.  In  the 
past  few  decades,  numerous  classification  techniques,  such  as 
SVM  [7],  k-nearest-neighbor  classifier^,  multimodel  logistic 
regression  |9]  and  neural  network  fT0|7  have  been  proposed 
to  alleviate  some  of  these  problems  to  achieve  an  acceptable 
performance  for  HSI  classification. 

A.  Sparse  Representation  for  HSI  classification 

More  recently,  researchers  have  focused  attention  on  de¬ 
scribing  the  high  dimensional  data  as  a  sparse  linear  com¬ 
bination  of  dictionary  atoms.  Sparse  representation  classifier 
(SRC)  was  proposed  in  GD  and  has  been  successfully  applied 
to  a  wide  variety  of  applications,  such  as  face  recognition  GD- 
visual  tracking  GD  speech  recognition  GD  and  aerial  image 
detection  GD-  SRC  has  also  been  applied  to  HSI  classification 
by  Chen  et.  al.  GD  where  a  dictionary  was  constructed  by 
stacking  all  the  labelled  samples.  Success  of  SRC  requires  that 
the  high  dimensional  data  belonging  to  the  same  class  to  lie 
in  a  low  dimensional  subspace.  The  outstanding  classification 
performance  is  due  to  the  robustness  of  sparse  recovery,  which 
is  largely  provided  by  the  high  redundancy  and  low  coherency 
of  the  dictionary  atoms.  A  low  reconstruction  error  and  a 
high  sparsity  level  can  be  achieved  if  the  designed  dictionary 
satisfies  the  above  properties.  Unfortunately,  in  practice,  the 
HSI  dictionary  usually  does  not  have  the  above  properties  due 
to  the  small  number  of  bluehighly  correlated  labelled  training 
samples  ©• 

Due  to  these  undesired  properties  of  the  HSI  dictionary,  the 
sparse  recovery  can  become  unstable  and  unpredictable  such 
that  even  pixels  belonging  to  the  same  class  can  have  totally 
different  sparse  codes.  The  problem  induced  by  the  high- 
coherency  of  the  dictionary  atoms,  which  can  be  alleviated 
through  decreasing  the  variation  between  the  sparse  codes 
of  the  hyperspectral  pixels  that  belong  to  the  same  class. 
In  HSI,  pixels  that  are  spatially  close  to  each  other  usually 
have  similar  spectral  features  and  belong  to  the  same  class. 
Previous  research  has  shown  that  the  sparse  codes  of  neigh¬ 
boring  pixels  can  become  similar  by  enforcing  a  structured 
sparsity  constraint  (prior).  The  simultaneous  sparse  recovery 
is  analytically  guaranteed  to  achieve  a  sparser  solution  and 
a  lower  reconstruction  error  with  a  smaller  dictionary  fT6|. 
A  variety  of  structured  sparsity  priors  are  proposed  in  the 
literature  GD  that  are  capable  of  generating  different  desired 
sparsity  patterns  for  the  sparse  codes  of  neighboring  pixels. 
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The  joint  sparsity  prior  GD  assumes  that  the  features  of  all  the 
neighboring  pixels  lie  in  the  same  low  dimensional  subspace 
and  all  the  corresponding  sparse  codes  share  the  same  set 
of  dictionary  atoms.  Therefore,  the  sparse  codes  have  a  row 
sparsity  pattern,  where  only  a  few  rows  of  the  sparse  codes  are 
nonzero  (18),  (T9).  The  collaborative  group  sparsity  prior  (20) 
enforces  the  coefficients  to  have  a  group-wise  sparsity  pattern, 
where  the  coefficients  within  each  active  group  are  dense. 
The  collaborative  hierarchical  sparsity  prior  ED  enforces  the 
sparse  codes  to  be  not  only  group-wise  sparse,  but  also  sparse 
within  each  active  group.  The  low  rank  prior  (22)  assumes 
that  the  neighboring  pixels  are  linearly  dependent.  It  does 
not  necessary  lead  the  coefficients  to  be  sparse,  which  is 
detrimental  for  a  good  classification.  However,  the  low  rank 
group  prior  proposed  in  ED  is  able  to  enforce  both  a  group 
sparsity  prior  and  a  low  rank  prior  on  the  sparse  codes  by 
forcing  the  same  group  of  dictionary  atoms  to  be  active  if 
and  only  if  the  corresponding  neighboring  pixels  are  linearly 
dependent.  The  Laplacian  sparsity  prior  (23)  uses  a  Laplacian 
matrix  to  describe  the  degree  of  similarity  between  the  neigh¬ 
boring  pixels.  The  neighboring  pixels  that  have  less  spectral 
features  in  common  are  less  encouraged  to  have  a  similar 
sparse  codes.  It  has  been  shown  that  all  the  structured  sparsity 
priors  are  capable  of  obtaining  a  smoother  classification  map 
and  improving  the  classification  performance  ED- 


from  different  classes.  It  increases  the  discriminability  of  the 
sparse  codes  by  decreasing  the  coherency  of  the  atoms  from 
different  classes.  The  label  consistent  K-SVD  (LC-KSVD) 
ED  optimizes  the  dictionary  and  classifier’s  parameter  by 
minimizing  the  summation  of  reconstruction  and  classification 
errors.  It  combines  the  dictionary  and  classifier’s  parameter 
into  a  single  parameter  space,  which  makes  it  possible  for  the 
optimization  procedure  to  be  much  simpler  than  those  used  in 
classical  SRC.  However,  a  desired  and  accurate  solution  is  not 
guaranteed  (32)  because  the  cost  function  can  be  minimized 
by  decreasing  the  reconstruction  error  while  the  classification 
error  is  increased.  A  bilevel  optimization  formulation  would 
be  more  appropriate  (33),  where  the  update  of  the  dictionary  is 
driven  by  the  minimization  of  the  classification  error.  The  task- 
driven  dictionary  learning  (TDDL)  (27)  exploits  this  idea  with 
theoretical  proof  and  demonstrates  a  superior  performance. 
The  supervised  translation-invariant  sparse  coding,  which  uses 
the  same  scheme  as  TDDL,  is  developed  independently  by 
(34).  It  is  a  more  general  framework  that  can  be  applied  not 
only  to  classification,  but  also  nonlinear  image  mapping,  digi¬ 
tal  art  authentication  and  compressive  sensing.  More  recently, 
the  group  sparsity  prior  is  enforced  on  a  single  measurement 
and  the  corresponding  TDDL  optimization  algorithm  is  devel¬ 
oped  in  (35)  in  order  to  improve  the  performance  of  region 
tagging. 


B.  Dictionary  Learning  for  Sparse  Representation 

In  the  classical  SRC,  the  dictionary  is  constructed  by 
stacking  all  the  training  samples.  The  sparse  recovery  can  be 
computationally  burdensome  when  the  training  set  is  large. 
Besides,  the  dictionary  constructed  in  this  manner  can  neither 
be  optimal  for  reconstruction  purposes  nor  for  classification 
of  signals.  Previous  literature  have  shown  that  a  dictionary 
can  be  trained  to  have  a  better  representation  of  the  dataset. 
Unsupervised  dictionary  learning  methods,  such  as  the  method 
of  optimal  direction  (MOD)  (24),  K-SVD  [251  and  online  dic¬ 
tionary  learning  (26),  are  able  to  improve  the  signal  restoration 
performance  of  numerous  applications,  such  as  compressive 
sensing,  signal  denoising  and  image  inpainting. 

However,  the  unsupervised  dictionary  learning  method  is 
not  suitable  for  solving  classification  problems  since  a  lower 
reconstruction  error  does  not  necessarily  lead  to  a  better  classi¬ 
fication  performance.  In  fact,  it  is  observed  that  the  dictionary 
can  have  an  improved  classification  result  by  sacrificing  some 
signal  reconstruction  performance  (27).  Therefore,  supervised 
dictionary  learning  methods  [28]  are  proposed  to  improve 
the  classification  result.  Unlike  the  unsupervised  dictionary 
learning,  which  only  trains  the  dictionary  by  pursuing  a 
lower  signal  reconstruction  error,  the  supervised  learning  is 
able  to  directly  improve  the  classification  performance  by 
optimizing  both  the  dictionary  and  classifier’s  parameter  si¬ 
multaneously.  The  discriminative  dictionary  learning  in  (29) 
minimizes  the  classification  error  of  SRC  by  minimizing  the 
reconstruction  error  contributed  by  the  atoms  from  the  correct 
class  and  maximizing  the  error  from  the  remaining  classes. 
The  incoherent  dictionary  learning  in  [30)  uses  SRC  as  the 
classifier  and  tries  to  eliminate  the  atoms  shared  by  pixels 


C.  Contributions 


In  this  paper,  we  propose  a  novel  method  that  enforces 
the  joint  or  Laplacian  sparsity  prior  on  the  sparse  recovery 
stage  of  TDDL.  The  existing  dictionary  learning  methods 
have  only  been  developed  for  reconstructing  or  classifying  a 
single  measurement.  Therefore,  it  is  advantageous  to  incorpo¬ 
rate  structured  sparsity  priors  into  the  supervised  dictionary 
learning  in  order  to  achieve  a  better  performance.  This  paper 
makes  the  following  contributions: 

•  We  propose  a  new  dictionary  learning  algorithm  for 
TDDL  with  joint  or  Laplacian  sparsity  in  order  to  ex¬ 
ploit  the  spatial- spectral  information  of  HSI  neighboring 
pixels. 

•  We  show  experimentally  that  the  proposed  dictionary 
learning  methods  have  a  significantly  better  performance 
than  SRC  even  when  the  dictionary  is  highly  compact. 

•  We  also  describe  an  optimization  algorithm  for  solving 
the  Laplacian  sparsity  recovery  problem.  The  proposed 
optimization  method  is  much  faster  than  the  modified 
feature  sign  search  used  in  (23). 


The  remainder  of  the  paper  is  organized  as  follows.  In 
Section  [II]  a  brief  review  of  TDDL  is  given.  In  Section  [III] 
we  propose  a  modified  TDDL  algorithm  with  the  joint  sparsity 
prior.  TDDL  with  the  Laplacian  prior  and  a  new  algorithm  for 
recovering  the  Laplacian  sparse  problem  are  stated  in  Section 
IV  In  Section  [V]  we  show  that  our  method  is  superior  to 


other  HSI  classification  methods  through  experimental  results 
on  several  HSI  images.  Finally,  we  provide  our  conclusion  in 
Section  lYll 
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II.  Task-driven  Dictionary  Learning 

In  TDDL  (27),  signals  are  represented  by  their  sparse 
codes,  which  are  then  fed  into  a  linear  regression  or  logistic 
regression.  Consider  a  pair  of  training  samples  (x,  y),  where 
x  G  Mm  is  the  HSI  pixel,  M  is  the  number  of  spectral  bands, 
and  y  G  RK  is  a  binary  vector  representation  of  the  label  of 
the  sample  x.  K  is  the  maximum  class  index.  Pixel  x  can  be 
represented  by  a  sparse  coefficient  vector  a(D,  x)  G  RN  with 
respect  to  some  dictionary  D  G  MMxAr  consisting  of  N  atoms 
by  solving  the  optimization 

a(D,x)  =  argmin||x-Dz|||  +  A||z||i  +  ^||z|||,  (1) 

z  2 

, where  A  and  e  are  the  regularization  parameters.  A  controls 
the  sparsity  level  of  the  coefficients  a.  In  our  experiments, 
we  set  e  to  0  since  it  does  not  affect  the  convergence  of  the 
algorithm  and  always  gives  satisfactory  results. 

To  optimize  the  dictionary,  TDDL  first  defines  a  convex 
function  £(D,  W,  {x^}^1)  to  describe  the  classification  risk 
in  terms  of  the  dictionary  atoms,  sparse  coefficients  and  the 
classifier’s  parameter  W.  The  function  is  then  minimized  as 
follows 

min£(D>w-  {xi}?=i)  =  min/(D,W,{xi}f=1)+^||W|||., 

D.  W  13.  W  2 

(2) 

where  p  >  0  is  a  classifier  regularization  parameter  to  avoid 
overfitting  of  the  classifier  (36).  The  convex  function  /  is 
defined  as 


1  5 

/(D,W,{xjf=1)  =  J(yi,'W,ai(p,xi)), 


(3) 


where  S  is  the  total  number  of  training  samples  and 
£(y^,  W,  a^(D,  Xi))  is  the  classification  error  for  a  training 
pair  which  is  measured  by  a  linear  regression,  i.e. 

J( Yi,  W,  a<(D,  x,))  =  1 1 1 y i  -  Wa, ||§. 

In  the  following  part  of  the  section,  we  omit  the  subscript 
i  of  a  for  notational  simplicity.  The  dictionary  D  and  the 
classifier  parameter  W  are  updated  using  a  stochastic  gradient 
descent  algorithm,  which  has  been  independently  investigated 
by  (27),  (34).  The  update  rules  for  D  and  W  are 


(4) 


( D^t+1)  =  DW  -  p®  ■  dC^/d D, 
jW(t+i)  =  ww  _  p(t) .  acw/aw, 


where  t  is  the  iteration  index  and  p  is  the  step  size.  The 
equations  for  updating  the  classifier  parameter  W  is  straight¬ 
forward  since  £(y,  W,  a(D,  x))  is  both  smooth  and  convex 
with  respect  to  W.  We  have 


dC 

dW 


=  (Wet  -  y)  aT  +  //W. 


(5) 


The  updating  equation  for  the  dictionary  can  be  obtained  by 
applying  error  backpropagation,  where  the  chain  rule  is  applied 

dC  dC  da 
m  da  <9D 

The  difficulty  of  acquiring  a  specific  form  of  the  above  equa¬ 
tion  comes  from  doL/dr>.  Since  the  sparse  coefficient  a(D,x) 
is  an  implicit  function  of  D,  an  analytic  form  of  a  with  respect 


(6) 


to  D  is  not  available.  Fortunately,  the  derivative  doc/dr>  can  still 
be  computed  by  either  applying  optimality  condition  of  elastic 
net  (27),  [37 ]]  or  using  fixed  point  differentiation  (34),  [38]. 

We  now  focus  on  computing  the  derivative  using  the  fixed 
point  differentiation.  As  suggested  in  (38),  the  gradient  of  Eq. 
0  reaches  0  at  the  optimal  point  a 

d|lx  —  Doj||  =  AgH|i 

da  ot=6t  da 

Expanding  Eq.  ([7]),  we  have 


2Dt(x  —  Da) 


=  A  •  sign(a) 

ol=6l 


at=6t 


(8) 


In  order  to  evaluate  da/ar> ,  the  derivative  of  Eq.  ^  with 
respect  to  each  element  Dmn  of  the  dictionary  is  required. 
Since  the  differentiation  of  the  sign  function  is  not  well 
defined  at  zero  points,  we  can  only  compute  the  derivative 
of  Eq.  (8j)  at  fixed  points  when  a[n]  ^  0  (34) 


da  a 
dDrnn 


(DlDA)- 


<9D^x 

dDmn 


SD^Da  \  daAc 

dDmn  J  dDmn 

(9) 


where  A  and  Ac  are  the  indices  of  the  active  and  inactive 
set  of  a  respectively.  Dmn  G  f  is  the  (m,  n)  element  of  D. 
(DIDa)-1  is  always  invertible  since  the  number  of  active 
atoms  |A|  is  always  much  smaller  than  the  feature  dimension 
M. 


III.  TDDL  WITH  JOINT  SPARSITY  PRIOR 

We  now  extend  TDDL  by  using  a  joint  sparsity  (JS)  prior 
(TDDL-JS).  The  joint  sparsity  prior  (18),  (19)  enforces  the 
sparse  coefficients  of  the  test  pixel  and  its  neighboring  pixels 
within  the  neighborhood  window  to  have  row  sparsity  pattern, 
where  all  pixels  are  represented  by  the  same  atoms  in  the 
dictionary  so  that  only  few  rows  of  the  sparse  coefficients 
matrix  are  nonzero.  The  joint  sparse  recovery  can  be  solved 
by  the  following  Lasso  problem 

A  =  argmin  ||X  -  BZ\\2F  +  A||Z||li2,  (10) 

where  A,  Z  G  MiVxP  are  sparse  coefficient  matrices  and 
X  =  [xi,...,xp]  G  RMxP  represents  all  the  pixels  within  a 
neighborhood  window  centered  on  a  test  (center)  pixel  xc.  De¬ 
fine  the  label  of  the  center  pixel  as  yc.  P  is  the  total  number  of 

p 

pixels  within  the  neighborhood  window.  ||Z||i?2  = 

i=l 

is  the  ^i?2-norm  of  Z.  Z i  G  MlxP  is  the  ith  row  of  Z.  Many 
sparse  recovery  techniques  are  able  to  solve  Eq.  (JTOj,  such  as 
the  Alternating  Direction  Method  of  Multipliers  (39),  Sparse 
Reconstruction  by  Separable  Approximation  (SpaRSA)  (40) 
and  Fast  Iterative  Shrinkage-Thresholding  Algorithm  (FISTA) 


Once  the  sparse  code  A  is  obtained,  the  sparse  codes  ac 
of  the  center  pixel  xc  is  projected  on  each  of  the  K  decision 
planes  of  the  classifier.  The  plane  with  the  largest  projection 
indicates  the  class  that  the  center  pixel  xc  belongs  to, 

identity  (xc)  =  argmaxy/c  =  argmax(Wac)/c,  (11) 

k  k 
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where  ac  G  RN  is  the  sparse  coefficients  of  the  center  pixel. 
In  the  training  stage,  it  is  expected  that  the  projection  of  the 
decision  plane  corresponding  to  the  class  of  the  center  pixel 
should  be  increased  while  other  planes  should  be  orthogonal 
to  olc.  Therefore,  given  the  training  data  (X,yc),  the  classifi¬ 
cation  error  for  the  center  pixel  xc  is  defined  as 

£(yc,W,ac(D,X))  =  ||yc  —  Wac||l  +  |||W|||.,  (12) 

In  order  to  update  the  dictionary  D,  we  need  to  apply  a  chain 
rule  similar  to  the  one  in  Eq. 

dC  dC  d  A 

(13) 


3D  3  A  3D 

dA 


Now  we  focus  on  the  difficult  part  ^  of  Eq 
the  fixed  point  differentiation  on  Eq.  ([TO]),  we 


e  ha1 


IX  —  DA 


dA 


=  -A 


0||A| 


1,2 


A=A 


(14) 


0,  the  derivative  is  not  well  defined,  so  we  set 


dA  i 


LiVA 


Dt  (x  -  Da)  =  A 

v  )  LIIAill2  IIajva 

Computing  the  derivative  of  Eq.  ©  with  respect  to  D 


(15) 


and  transposing  both  sides 

0  |(X  —  DA)t  D  j 


0D„ 


wl,erli'-  =  ire 

Eq.  |16j),  we  have 
(dXTt) 

vec  — - A 

y  dDmn 


where  r  =  T  | 


=  A 


A,1  A; 


ri 


dAj 

dDn 


T 


■  NA 


d^-NA 

3D 


Algorithm  1  Stochastic  gradient  descent  algorithm  for  task- 
driven  dictionary  learning  with  joint  sparsity  prior 

Require:  Initial  dictionary  D  and  classifier  W.  Parameter  A, 
p  and  to. 

for  t  =  1  to  T  do 

Draw  one  sample  (X,  yc)  from  training  set. 


5: 


Find  sparse  sparse  code  A  according  to  Eq.  ([TO]) 
Find  the  active  set  A  and  define  NA  =  |A| 

A  {i  :  ||  || 2  7^  0,  i  G  {1, . . . ,  N }}, 

where  A^  is  the  ith  row  of  A. 

Compute  r  G  ’RN^PxN^p 


.  Employing 
ave 


r  =  iv 


r?  = 


9  Tata, 

a,A7 


A=A  3A 

In  the  following  part  of  this  section,  we  omit  the  fixed  point 
notation.  Eq.  (14)  is  only  differentiable  when  || A^ || 2  7^  0, 
where  A^  denotes  the  ith  row  of  A.  At  points  where  ||  A; 


llAiiii 


,  i  15 


,JVa, 


6: 


1 2  = 

=  0. 

Denote  A  =  A  a  G  RJNaXP,  where  A  is  the  active  set  such 
that  A  =  {i  :  ||  A^ ||2  +  0,i  G  {1, . . .  ,jV}},  NA  =  |A|,  Aa  is 
composed  of  active  rows  of  A,  and  D  is  the  active  atoms  of 
D.  Expanding  the  derivative  of  Eq.  (|T4])  on  both  sides  on  the 
feasible  points, 

AT  nT 


7: 


llAiHa 

where  ®  is  the  direct  sum  of  matrices. 

Compute  7  G  RNaP 

7  =  (DtD  01 P  +  Ar)_Tuec((WA  -  Y)TW). 

where  vec(-)  and  W  denote  the  vectorization  operator 
and  A  columns  of  W  respectively. 

Let  (3  G  RNxP.  Set  f3Ac  =  0  and  construct  (3A  G 
RNaxp  fast  satisfies 


vec 


(A) 


=  7- 


(16) 

i  =  1, . . . ,  Na.  By  vectorizing 


8:  Choose  the  learning  rate  pt  min(p,  Py). 

9:  Update  the  parameters  by  gradient  projection  step 

W  ^  W  -pt  ((Woe  -  y )a J  +  /iW) , 

D  <-  D  -  pt(- D/3At  +  (X  -  DA)/3t), 

and  normalize  every  column  of  D^+1^  with  respect  to 
^2-norm. 

10:  end  for 

it:  return  D  and  W. 


<9DtD  <9At 


dD„ 


dDn 


dtd 


=  A  •  Tvec 


3AT 


TDDL  WITH  LAPLACIAN  SPARSITY  PRIOR 


3DmnThe  joint  sparsity  prior  is  a  relatively  stringent  constraint  on 


■  Yy- 


(17) 


From  Eq.  (17),  we  reach  the 


vectorization  form  of  the  derivative  of  A  with  respect  to  Dn 
given  as 


vec  \ 


3AT 

dDmn 


(61 


D  0  Ip  +  XT 


-1 


vec  \ 


3DD 

dDmn 


(18) 

Now  we  can  update  the  dictionary  element-wise  using  Eq. 
(p~8).  In  order  to  reach  a  more  concise  form  for  updating 
the  dictionary,  we  perform  algebraic  transformations  on  Eq. 
(p~3)  and  Eq.  (fl~8]),  which  are  illustrated  in  Appendix  |VII|  We 
illustrate  the  overall  optimization  for  TDDL-JS  in  Algorithm 
[I]  It  should  be  noted  that  in  the  Algorithm  [l]  we  define  A  = 
[0, . . .  ,ac, . . .  ,0]  G  RNxP  and  Y  =  [0, . . . ,  y, . . . ,  0]  G 

jrpKxP 


the  sparse  codes  since  it  assumes  that  all  the  neighboring  pixels 
have  the  same  support  as  the  center  pixel.  The  assumption 
of  the  joint  sparsity  prior  can  easily  be  violated  on  non- 
homogeneous  regions,  such  as  a  region  that  contains  pixels 
from  different  classes.  This  makes  choosing  a  proper  neighbor- 
<9Koo6>^indow  size  a  difficult  problem.  When  the  window  size 
dD^QQ  warge,  the  sparse  codes  of  the  non-homogeneous  regions 
within  the  window  are  indiscriminative.  On  the  other  hand,  the 
sparse  codes  are  not  stable  if  the  window  size  is  chosen  to  be 
too  small.  Ideally,  we  hope  that  the  performance  is  insensitive 
to  both  the  choice  of  the  window  size  and  the  topology  of  the 
image.  To  achieve  this  requirement,  we  propose  to  enforce  the 
Laplacian  sparsity  (LP)  prior  (TDDL-LP)  on  the  TDDL,  where 
the  degree  of  similarity  between  neighboring  pixels  can  be 
utilized  to  push  the  sparse  codes  of  the  neighboring  pixels  that 
belong  to  the  same  class  to  be  similar,  instead  of  enforcing  all 
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the  neighboring  pixels  to  have  a  similar  sparse  codes  blindly. 
The  corresponding  Lasso  problem  can  be  stated  as  follows 

p 

A  =  argmin  ||X  -  BZ\\22  +  A||Z||i  +  7^cjj||Zi  -  2?\\l 

i,3 

(19) 

where  Z2  and  ZJ  denote  the  ith  and  jth  columns  of  Z.  Cij  is 
a  weight  whose  value  is  proportional  to  the  spectral  similarity 
of  X2  and  XJ,  which  are  the  ith  and  jth  columns  of  X.  7  is  a 
regularization  parameter. 

The  Laplacian  sparse  recovery  described  by  Eq.  ©  in 
(23)  is  able  to  discriminate  pixels  from  different  classes  by 
defining  an  appropriate  weighting  matrix  C  =  [c^]  G  MPxP. 
Additionally,  it  enforces  both  the  support  and  the  magnitude 
of  sparse  coefficients  of  similar  spectral  pixels  to  be  similar, 
whereas  the  joint  sparsity  prior  enforces  sparse  coefficients  of 
all  the  pixels  within  the  neighborhood  window  to  have  the 
same  support.  Eq.  ©  can  be  reformulated  as 

A  =  argmm  ||X  -  DZ|||  +  A||Z||i  +  7tr(ZLZT),  (20) 

where  L  =  B  —  C  G  MPxP  is  the  Laplacian  matrix  (42). 
B  =  [bij]  G  MPxP  is  a  diagonal  matrix  such  that  bn  =  cij- 

3 

In  this  paper,  we  adopt  the  method  of  Sparse  Reconstruction 
by  Separable  Approximation  (SpaRSA)  (TT),  (40)  to  solve  the 
Laplacian  sparse  coding  problem. 


A.  Sparse  Recovery  Algorithm 

A  modified  feature  sign  search  (23)  is  capable  of  solving 
the  optimization  problem  ([20]).  It  uses  coordinate  descent  to 
update  each  column  of  A  iteratively.  Although  it  gives  plausi¬ 
ble  performance  for  the  SRC -based  HSI  classification  © 
it  demands  a  high  computational  cost.  The  SpaRSA-based 
method  can  achieve  a  similar  optimal  solution  of  Eq.  ([20]) 
while  being  less  computational  burdensome.  Despite  the  fact 
that  our  previous  work  ©  has  shown  that  the  performance  of 
the  SRC-based  approach  for  HSI  classification  can  be  largely 
influenced  by  the  choice  of  specific  optimization  technique,  we 
found  that  such  influence  is  reasonably  small  when  employing 
the  dictionary-learning-based  approach.  Therefore,  we  use 
a  SpaRSA-based  method  to  solve  the  sparse  recovery  for 
the  Laplacian  sparsity  prior.  Although,  SpaRSA  is  originally 
designed  to  solve  the  optimization  of  single-signal  case,  it  can 
be  easily  extended  to  tackle  the  problem  with  multiple  signals, 
such  as  the  collaborative  hierarchical  Lasso  (C-Hilasso)  © 

SpaRSA  is  able  to  solve  optimization  problems  that  have 
the  following  form 


min  /  (A) 

AeR^x-P  j 


•AV-(A), 


(21) 


where  /  :  M7VxP  g  M  is  a  convex  and  smooth  function, 
ip  :  RNxP  M  is  a  separable  regularizer  and  A  is  the 
regularization  parameter.  In  the  particular  case  of  the  Laplacian 
sparse  recovery,  the  regularizer  ip  is  chosen  to  be  the  i\—  norm, 
i.e.  ip( A)  =  || A ||i,  and  the  convex  function  /  is  set  as 

/  (A)  =  ||X  —  DA||p  +  ■ytr  (ALA1") .  (22) 


Algorithm  2  Sparse  recovery  for  Laplacian  sparsity  prior 
using  SpaRSA 

Require:  Dictionary  D,  constants  po  >0,  0  <  rjmin  <  r/max, 

p  >  1 

l:  Set  t  =  0  and  A^0^  =  0 

2:  repeat 

3:  choose  T](  ^  G  [Pmim  ?7max] 

4:  compute  UW  <-  AW  -  ^V/( AW). 

5:  repeat 

6:  A«  <-  S^_  (UW)  , 

7:  rjW  <—  pp^\ 

8:  until  stopping  criterion  is  satisfied 

9:  t  i —  t  T"  1. 

10:  until  stopping  criterion  is  satisfied 

ll:  return  The  optimal  sparse  coefficients  A*. 


In  order  to  search  the  optimal  solution  of  Eq.  ([21),  SpaRSA 
generates  a  sequence  of  iterations  A®,  t  =  1,2,...,  by 
solving  the  following  subproblem 


A<t+1)  Earg  minp  (z  -  A^)  V/(A^)  +  ^-  ||Z-A^  (Z) , 

(23) 

where  rj^  >  0  is  a  nonnegative  scalar  such  that  rj^  = 
and  p  >  1.  The  Eq.  (23)  can  be  simplified  into  the 
following  form  by  eliminating  the  terms  independent  of  Z 


1 


mm 


^|Z-U«|||  +  7U(Z) 


« 


(24) 


zemJVxP  2  "*  py 

where  =  A® - 4jV/(A^).  The  optimization  problem 

in  Eq.  ([24])  is  separable  element-wise,  which  can  be  reformu¬ 
lated  into 

min  I(%-w-1)2+-A^(z),  Vi  s  1, . . . ,  N  and  j  =  1, . . . ,  P. 

A-ij  a  py  ' 

(25) 

The  problem  in  Eq.  ([25])  has  a  unique  solution  and  can  be 
solved  by  the  well-known  soft  thresholding  operator  S(-) 


zij  =  S^J  (4?)  =  sign(Uij)  max{0,  \uij\  -  A}.  (26) 

Comparing  with  the  algorithm  proposed  in  (23),  which  is 
based  on  the  coordinate  descent,  Laplacian  sparse  recovery 
using  SpaRSA  is  more  computationally  efficient  since  it  is 
able  to  cheaply  search  for  a  better  descent  direction  V/( A). 
The  corresponding  optimization  is  stated  in  Algorithm  [2] 


B.  Dictionary  Update 


In  order  to  adjust  the  dictionary,  we  now  follow  Eq.  © 
to  derive  using  the  fixed  point  differentiation.  Applying 
differentiation  on  Eq.  ([19])  on  the  fixed  point  A 


d||X-DA||!+7 tr  (ALAt) 


dA 


=  -X- 


A=A 


dA 


\A= 


hi) 


In  the  following  part,  we  omit  the  fixed  point  notation.  By 
computing  the  derivation  and  then  applying  the  vectorization 
on  Eq.  ([27),  we  have 


vec  (Dt  (X  -  DA)  -  7AL)  =  A  •  vec  ( sign  (A)) .  (28) 


(22) 
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Algorithm  3  Stochastic  gradient  descent  algorithm  for  task- 
driven  dictionary  learning  with  Laplacian  sparsity  prior 


Require:  Initial  dictionary  D  and  classifier  W.  Parameter  A, 
p  and  to. 

for  t  =  1  to  T  do 

Draw  one  sample  (X,  yc)  from  training  set. 


Find  sparse  code  A  according  to  Eq.  (IQ). 
Find  the  active  set  A 


A  <-  {i  :  vec(A)i  ^  0,  i  G  {1, . . . ,  NP }}, 

where  vec(A)i  is  the  ith  element  of  vec(A). 

5:  Let  (3  G  RNxP.  Set  vee(/3)Ac  =  0  and  compute 

vec(f3)A 

vec(f3)A  =  (IP(g)DTD+7L(8)IAr)X>ec(WT(WA-Y)) 

and  (g)  denotes  the  Kronecker  product. 

6:  Choose  the  learning  rate  pt  <—  min(p,  Py). 

7:  Update  the  parameters  by  gradient  projection  step 

W  ^  W  -Pt  ((Wac  -  y )aj  +  fiW) , 

D  •<-  D  —  pt(— D/3At  +  (X  -  DA)/3t), 


(initial  step  size),  N  (dictionary  size)  and  P  (number  of 
neighboring  pixels),  would  introduce  significant  computational 
cost.  Instead,  we  search  for  the  optimal  values  for  the  above 
parameters  according  to  the  following  procedure. 

•  The  candidate  dictionary  sizes  are  from  5  to  10  atoms 
per  class.  The  choice  of  dictionary  size  depends  on  the 
classification  performance  and  computational  cost.  In  our 
experiment,  we  set  the  dictionary  size  to  be  5  atoms  per 
class. 

•  Searching  for  the  optimal  window  size  and  the  regulariza¬ 
tion  parameters  would  be  cumbersome.  Empirically,  we 
found  that  the  optimal  regularization  parameters  are  less 
likely  to  be  affected  by  the  choice  of  the  window  size. 
Therefore,  for  each  image,  we  fix  the  window  size  to  be 

5  3  x  3  in  order  to  save  computational  resource  during  the 

search  of  the  optimal  regularization  parameters.  Candi¬ 
date  regularization  parameters  are  {l0-3,10-2,10-1}. 

•  The  possible  candidate  window  sizes  are  3  x  3,  5x5, 
7x7  and  9x9.  We  search  for  the  optimal  window  size 
for  each  image  after  finding  the  optimal  regularization 
parameters. 


and  normalize  every  column  of  with  respect  to 

£2 -norm. 

8:  end  for 

9:  return  D  and  W. 


The  differentiation  dvec^n{A))  not  weq  defined  on  zero 
points  of  vec  ( sign  (A)).  Similar  as  in  TDDL-JS,  we  set  the  ith 


element - ^  1  ])i  =  0  when  vec  ( sign  (A))  •  =  0.  Denote 

the  A  as  the  index  set  of  nonzero  elements  of  vec  ( sign  (A)). 
Compute  the  derivative  of  Eq.  ([28])  with  respect  to  Dmn 


d  {vec  (Dt  (X  -  DA)  -  7AL)  a} 


dDn 


which  leads  to 

SDtD 


vec 


dDr . 


A- 


<9DtX 

c)D 


D  D 


<9A 

0Drn, 


=  0, 


_dA_ 

'dDrnr 


(29) 


=  0. 

60) 


Now  we  reach  the  desired  gradient 
<9A 


vec 


(ip' 


dD„ 


DtD 


■  7L  ( 


By  applying  algebraic  simplification  to  Eq.  ([3l]),  which  is 
shown  in  Appendix  |VII|  we  reach  the  optimzation  for  TDDL- 
LP  as  stated  in  the  Algorithm  [3]  It  should  be  noted  that  A 
and  Y  have  the  same  definitions  as  those  in  Algorithm  [l] 


V.  EXPERIMENTS 


A.  Experiment  Setup 

Cross-validation  to  obtain  the  optimal  values  for  all  pa¬ 
rameters,  including  A,  e,7  (sparse  coding  regularization  pa¬ 
rameters),  p  (regularization  parameter  for  the  classifier),  po 


TABLE  I:  Parameters  Used  in  the  Paper 


Structured  Priors 

A 

7 

p 

£1 

icr2 

10“2 

JS 

(M 

1 

O 

1— 1 

10“3 

LP 

IQ”2 

10-a 

IQ-1 

Computing  the  gradient  for  a  single  training  sample  at  each 
iteration  of  Algorithm  [T]  or  [3]  will  make  the  algorithm  converge 
very  slowly.  Therefore,  following  the  previous  work  (25),  (27), 
we  implement  the  two  proposed  algorithms  with  the  mini¬ 
batch  method,  where  the  gradients  of  multiple  training  samples 
are  computed  in  each  iteration.  For  the  unsupervised  learning 
methods,  the  batch  size  is  set  to  200.  For  the  supervised  learn¬ 
ing  methods,  the  batch  size  is  set  to  100  and  to  =  T / 10.  We 
search  the  optimal  regularization  parameters  for  each  image 
and  found  that  their  optimal  values  are  coincidentally  the  same. 
The  reason  could  be  due  to  our  choice  of  a  large  interval 
for  the  search  grid.  The  regularization  parameters  used  in  our 
paper  are  shown  in  Table  |l|  We  set  p  =  10-4.  As  a  standard 
procedure,  we  evaluate  the  classification  performance  on  HSI 
image  using  the  overall  accuracy  (OA),  average  accuracy  (A A) 
and  kappa  coefficient  (k).  The  classification  methods  that  are 
tested  and  compared  are  SVM,  SRC,  SRC  with  joint  sparsity 
prior  (SRC-JS),  SRC  with  Laplacian  sparsity  prior  (SRC- 
LP),  unsupervised  dictionary  learning  (ODL),  unsupervised 
dictionary  learning  with  joint  or  Laplacian  sparsity  prior 
(ODL-JS,  ODL-LP),  TDDL,  TDDL-JS  and  TDDL-LP.  During 
the  testing  stage,  all  training  pixels  are  excluded  from  the  HSI 
image,  which  means  there  may  be  some  ‘holes’  (training  pixels 
deleted)  inside  a  neighborhood  window.  This  is  reasonable 
since  we  do  not  want  the  classification  results  to  be  affected 
by  the  spatial  distribution  of  the  labelled  samples.  We  use 
SPAMS  toolbox  (43)  to  perform  the  joint  sparse  recovery 
via  the  Fast  Iterative  Shrinkage-Thresholding  Algorithm  [41]. 
The  sparse  recovery  for  SRC-based  methods  are  performed 
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via  the  Alternating  Direction  Method  of  Multipliers  (39).  The 
modified  SpaRSA  shown  in  Algorithm  [2]  is  used  to  solve  the 
Laplacian  sparse  recovery  problem. 

For  the  unsupervised  dictionary  learning  methods,  the  dic¬ 
tionary  is  initialized  by  randomly  choosing  a  subset  of  the 
training  pixels  from  each  class  and  updated  using  the  online 
dictionary  learning  (ODL)  procedure  in  (26).  The  classifier’s 
parameter  are  then  obtained  by  using  a  multi-class  linear 
regression.  For  the  supervised  dictionary  learning  methods, 
the  dictionary  and  classifier’s  parameter  are  initialized  by  the 
training  results  of  ODL  for  the  unsupervised  method. 


Fig.  3:  The  effect  of  different  window  sizes  for  the  Indian  Pine 
image.  The  dictionary  size  is  fixed  at  five  atoms  per  class. 


(a)  (b) 


Fig.  1:  (a)  Training  sets  and  (b)  test  sets  of  the  Indian  Pine 
image. 


TABLE  II:  Number  of  training  and  test  samples  for  the  Indian 
Pine  image 


Class  # 

Name 

Train 

Test 

1 

Alfalfa 

6 

48 

2 

Corn-notill 

137 

1297 

3 

Corn-min 

80 

754 

4 

Corn 

23 

211 

5 

Grass/Pasture 

48 

449 

6 

Grass/Trees 

72 

675 

7 

Grass/Pasture-mowed 

3 

23 

8 

Hay-windrowed 

47 

442 

9 

Oats 

2 

18 

10 

Soybeans-notill 

93 

875 

11 

Soybeans-min 

235 

2233 

12 

Soybean-clean 

59 

555 

13 

Wheat 

21 

191 

14 

Woods 

124 

1170 

15 

Building-Grass-Trees-Drives 

37 

343 

16 

Stone-steel  Towers 

10 

85 

Total 

997 

9369 

Fig.  2:  The  result  with  different  dictionary  sizes  for  the  Indian 
Pine  image. 


B.  Classification  on  AVIRIS  Indian  Pine  Dataset 


We  first  perform  HSI  classification  on  the  Indian  Pine  im¬ 
age,  which  is  generated  by  Airborne  Visible/Infrared  Imaging 
Spectrometer  (AVIRIS).  Every  pixel  of  the  Indian  Pine  consists 
of  220  bands  ranging  from  0.2  to  2.4/rni,  of  which  20  water 
absorption  bands  are  removed  before  classification.  The  spatial 
dimension  of  this  image  is  145  x  145.  The  image  contains 
16  ground-truth  classes,  most  of  which  are  crops,  as  shown 
in  Table  [n]  We  randomly  choose  997  pixels  (10.64%  of  all 
the  interested  pixels)  as  the  training  set  and  the  rest  of  the 
interested  pixels  for  testing. 

The  total  iterations  of  unsupervised  and  supervised  dictio¬ 
nary  learning  methods  are  set  to  15  and  200  respectively  for 
this  image.  The  classification  results  with  varying  dictionary 
size  N  are  shown  in  Fig.  [2]  In  most  cases,  the  classification 
performance  increases  with  the  increment  in  the  dictionary 
size.  All  methods  attain  their  highest  OA  when  the  dictionary 
size  is  10  atoms  per  class.  The  OA  of  ODL-JS,  ODL-LP, 
TDDL-JS  and  TDDL-LP  do  not  change  much  when  the  dic¬ 
tionary  size  increase  from  5  to  10  atoms  per  class.  Therefore, 
it  is  reasonable  to  set  the  dictionary  size  to  be  5  atoms  per 
class  by  taking  computational  cost  into  account.  Fig.  [2]  also 
suggests  that  a  plausible  performance  can  be  obtained  even 
when  the  dictionary  is  very  small  and  not  over-complete.  The 
classification  performance  with  respect  to  the  window  size  is 
demonstrated  in  Fig.  [3]  Using  a  window  size  of  5  x  5,  ODL-JS 
and  TDDL-JS  achieves  the  highest  OA  of  88.36%  and  92.65%, 
respectively.  When  the  window  size  is  set  to  7  x  7,  the  ODL- 
LP  and  TDDL-LP  reach  their  highest  OA  =  91.39%  and  OA 
=  94.20%,  respectively.  ODL-JS  and  TDDL-JS  reach  better 
performance  when  the  window  size  is  not  larger  than  5x5.  The 
TDDL-LP  outperforms  all  other  methods  when  the  window 
size  is  7  x  7  or  larger.  Since  a  larger  window  size  has  more 
chances  to  include  non-homogeneous  regions,  it  verifies  our 
argument  that  the  Laplacian  sparsity  prior  works  better  for 
classifying  pixels  lying  in  the  non-homogeneous  regions. 

Detailed  classification  results  of  various  methods  are  shown 
in  Table  [Til]  and  visually  displayed  in  Fig.  [4]  The  OA  of 
ODL-LP  reaches  91.39%,  which  is  more  than  20%  higher 
than  that  of  ODL  and  3%  higher  than  that  of  ODL-JS.  The 
TDDL-LP  has  the  highest  classification  accuracy  for  most 
classes.  Most  methods  have  0%  accuracy  for  class  9  since 
there  are  too  few  training  samples  in  this  class.  The  overall 
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(a)  SVM,  OA  =  64.94% 


(b)  SRC,  OA=  71.17% 


(d)  SRC-LP,  OA  =  79.40% 


(e)  ODL,  OA  =  71.04% 


(f)  ODL-JS,  OA  =  88 . 36%  (g)  ODL-LP,  OA  =  9 1 . 39%  (h)  TDDL,  OA  =  8 1 . 43%  (i)  TDDL-JS,  OA  =  92 . 65%  (j)  TDDL-LP,  OA  =  94 . 20% 

Fig.  4:  Classification  map  of  the  Indian  Pine  image  obtained  by  (a)  SVM,  (b)  SRC,  (c)  SRC-JS,  (d)  SRC-LP,  (e)  ODL,  (f) 
ODL-JS,  (g)  ODL-LP,  (h)  TDDL,  (i)  TDDL-JS  and  (j)  TDDL-LP 


TABLE  III:  Classification  accuracy  (%)  for  the  Indian  Pine  image 


Dictionary  Size 

N  =  997 

N  =  80 

Class 

SVM 

SRC 

SRC-JS 

SRC-LP 

ODL 

ODL-JS 

ODL-LP 

TDDL 

TDDL-JS 

TDDL-LP 

1 

77.08 

68.75 

79.17 

82.42 

75.00 

97.92 

70.83 

50.00 

35.42 

56.25 

2 

84.96 

58.84 

81.94 

81.34 

59.69 

91.24 

94.26 

84.03 

94.57 

93.95 

3 

62.67 

24.40 

56.67 

47.35 

62.93 

81.20 

84.40 

69.73 

84.13 

92.13 

4 

8.57 

49.52 

27.62 

49.76 

23.81 

47.62 

61.90 

14.76 

79.05 

46.19 

5 

77.18 

81.88 

85.46 

83.96 

82.55 

93.29 

92.62 

89.04 

90.16 

90.83 

6 

91.82 

96.88 

98.36 

97.48 

88.24 

99.55 

98.96 

98.66 

99.55 

98.96 

7 

13.04 

0.00 

0.00 

0.00 

4.35 

17.39 

0.00 

0.00 

0.00 

95.65 

8 

96.59 

96.59 

100.00 

99.55 

96.36 

99.32 

99.32 

99.09 

100.00 

100.00 

9 

0.00 

5.56 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

10 

71.30 

24.00 

18.94 

31.89 

67.51 

77.73 

91.04 

72.90 

90.13 

94.03 

11 

35.25 

96.22 

91.63 

94.58 

67.94 

88.25 

94.10 

85.46 

96.22 

97.37 

12 

42.39 

32.97 

45.29 

64.68 

80.62 

88.59 

83.15 

59.06 

86.78 

95.47 

13 

91.05 

98.95 

99.47 

99.48 

95.79 

100.00 

100.00 

100.00 

100.00 

100.00 

14 

94.85 

98.97 

98.97 

99.49 

87.20 

97.77 

99.14 

98.11 

99.40 

99.40 

15 

30.70 

49.71 

55.85 

63.84 

32.16 

70.76 

67.84 

47.66 

77.78 

82.75 

16 

27.06 

88.24 

95.29 

97.65 

69.41 

96.47 

85.88 

92.94 

91.76 

98.82 

OA[%] 

64.94 

71.17 

76.41 

79.40 

71.04 

88.36 

91.39 

81.43 

92.65 

94.20 

AA[%] 

56.53 

60.72 

64.67 

64.67 

62.10 

77.94 

82.18 

66.43 

76.56 

83.86 

K 

0.647 

0.695 

0.737 

0.712 

0.691 

0.851 

0.907 

0.8087 

0.924 

0.940 

performance  of  TDDL-JS  and  TDDL-LP  have  at  least  13% 
improvement  over  the  other  conventional  dictionary  learning 
techniques.  TDDL-LP  significantly  outperforms  other  methods 
on  the  classes  that  occupy  small  regions  in  the  image.  The  class 
7  (Grass/Pasture-mowed),  lying  in  a  non-homogeneous  region, 
has  only  3  training  samples  and  23  test  samples.  The  TDDL- 
LP  is  capable  of  correctly  classify  95.65%  test  samples  while 
the  second  highest  accuracy  is  only  17.39%.  We  notice  that 
the  AA  of  both  ODL-LP  (82.18%)  and  TDDL-LP  (83.86%) 
are  at  least  4%  higher  than  that  of  the  other  methods.  This 
also  suggests  that  the  Laplacian- sparsity-enforced  dictionary 
learning  methods  work  better  on  non-homogeneous  regions, 
since  the  AA  can  only  attain  high  value  when  both  the  most 
regions  reach  high  accuracy. 

C.  Classification  on  ROSIS  Pavia  Urban  Data  Set 

The  last  two  images  to  be  tested  are  the  University  of  Pavia 
and  the  Center  of  Pavia,  which  are  urban  images  acquired  by 
the  Reflective  Optics  System  Imaging  Spectrometer  (ROSIS). 
It  generates  115  spectral  bands  ranging  from  0.43  to  0.86/im. 

The  University  of  Pavia  image  contains  610  x  340  pixels. 


Fig.  5:  (a)  Training  sets  and  (b)  test  sets  of  the  University  of 
Pavia  image. 


12  noisiest  bands  out  of  all  115  bands  are  removed.  There  are 


nine  ground- truth  classes  of  interests  as  shown  in  Table  IV 


For  this  image,  the  training  samples  were  manually  labelled 
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(f)  ODL-JS,  OA  =  75 . 83%  (g)  ODL-LP,  OA  =  78 . 15% 


(h)  TDDL,  OA  =  69.30% 


(i)  TDDL-JS,  OA  =  84.48% 


(j)  TDDL-LP,  OA  =  85.70% 


Fig.  6:  Classification  map  of  the  University  of  Pavia  image  obtained  by  (a)  SVM,  (b)  SRC,  (c)  SRC-JS,  (d)  SRC-LP,  (e) 
ODL,  (f)  ODL-JS,  (g)  ODL-LP,  (h)  TDDL,  (i)  TDDL-JS  and  (j)  TDDL-LP 


TABLE  V:  Classification  accuracy  (%)  for  the  University  of  Pavia  image 


Dictionary  Size 

N  =  3921 

N  =  45 

Class 

SVM 

SRC 

SRC-JS 

SRC-LP 

ODL 

ODL-JS 

ODL-LP 

TDDL 

TDDL-JS 

TDDL-LP 

1 

84.55 

57.11 

77.04 

95.08 

39.16 

86.64 

79.38 

74.60 

79.27 

87.77 

2 

82.45 

58.22 

67.98 

66.70 

66.37 

56.48 

75.89 

51.27 

86.85 

78.89 

3 

77.08 

57.33 

44.32 

77.55 

65.40 

80.72 

62.42 

77.19 

71.13 

78.79 

4 

94.19 

95.94 

95.13 

95.19 

78.67 

99.04 

96.91 

98.08 

98.87 

98.21 

5 

99.01 

100.00 

99.85 

100.00 

99.91 

100.00 

99.82 

99.91 

99.91 

99.91 

6 

23.55 

89.60 

88.31 

96.60 

64.94 

96.89 

72.13 

90.07 

68.74 

91.64 

7 

2.06 

83.27 

96.59 

96.59 

91.64 

91.23 

84.10 

86.14 

68.09 

93.17 

8 

33.89 

48.65 

65.20 

67.36 

67.36 

90.81 

75.98 

78.00 

95.54 

94.20 

9 

53.05 

93.69 

99.59 

99.59 

71.07 

98.37 

93.46 

95.72 

91.82 

95.09 

OA[%] 

69.84 

66.51 

74.05 

80.82 

64.57 

75.83 

78.15 

69.30 

84.48 

85.70 

AA[%] 

61.09 

75.98 

80.06 

88.80 

71.66 

88.91 

82.23 

83.44 

84.47 

90.85 

K 

0.569 

0.628 

0.681 

0.758 

0.549 

0.731 

0.747 

0.662 

0.817 

0.835 

by  an  analyst.  The  total  number  of  training  and  testing 
samples  is  3,921  (10.64%  of  all  the  interested  pixels)  and 
40, 002  respectively.  The  training  and  testing  map  are  visually 
displayed  in  Fig.  [5] 

For  the  University  of  Pavia,  we  set  the  total  iterations  of 
unsupervised  and  supervised  dictionary  learning  methods  to 
be  30  and  200  respectively.  The  window  size  is  set  to  5  x  5 
for  all  joint  or  Laplacian  sparse  regularized  methods  to  obtain 
the  highest  OA.  The  ODL-LP  is  able  to  reach  a  performance 
of  78.15%  for  OA,  which  is  more  than  14%  higher  than  that  of 
ODL.  The  ODL-JS  also  significantly  improves  the  OA,  which 


is  more  than  11%  higher  than  that  of  ODL.  TDDL-LP  has 
the  highest  OA  =  85.70%,  which  indicates  that  it  outperforms 
other  methods  when  classify  large  regions  of  the  image.  It  also 
has  the  highest  hi  =  0.935.  The  best  classification  accuracy 
for  class  1  (Asphalt),  which  consists  of  narrow  strips,  is 
obtained  by  using  TDDL-LP  (87.77%).  Class  2  (Meadows) 
is  composed  of  large  smooth  regions,  as  expected,  TDDL- 
JS  gives  the  highest  accuracy  (86.85%)  for  this  class.  TDDL 
has  large  amount  of  misclassification  pixels  for  class  2.  The 
highest  A  A  (90.85%)  is  given  by  TDDL-LP,  which  confirms 
that  the  TDDL-LP  is  superior  to  other  methods  when  classify 
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TABLE  IV:  Number  of  training  and  test  samples  for  the 
University  of  Pavia  image 


Class  # 

Name 

Train 

Test 

1 

Asphalt 

548 

6304 

2 

Meadows 

540 

18146 

3 

Gravel 

392 

1815 

4 

Trees 

524 

2912 

5 

Metal  sheets 

265 

1113 

6 

Bare  soil 

532 

4572 

7 

Bitumen 

375 

981 

8 

Bricks 

514 

3364 

9 

Shadows 

231 

795 

Total 

3921 

40002 

Fig.  7:  (a)  Training  sets  and  (b)  test  sets  of  the  Center  of  Pavia 
image. 


TABLE  VI:  Number  of  training  and  test  samples  for  the  Center 
of  Pavia  image 


Class  # 

Name 

Train 

Test 

1 

Water 

745 

64533 

2 

Trees 

785 

5722 

3 

Meadows 

797 

2094 

4 

Bricks 

485 

1667 

5 

Soil 

820 

5729 

6 

Asphalt 

678 

6847 

7 

Bitumen 

808 

6479 

8 

Tile 

223 

2899 

9 

Shadows 

195 

1970 

Total 

5536 

97940 

the  pixels  in  non-homogeneous  regions. 

The  third  image  where  we  evaluate  various  approaches  is 
the  Center  of  Pavia,  which  consists  of  1094  x  492  pixels. 
Each  pixel  has  102  bands  after  removing  13  noisy  bands. 
This  image  consists  of  nine  ground-truth  classes  of  interest 
as  shown  in  Table  [VI]  and  Fig.  [7]  5,  536  manually  labelled 
pixels  are  designated  as  the  training  samples  and  the  remaining 
97, 940  interested  pixels  are  used  for  testing. 

Since  this  image  has  more  labeled  samples  than  the  other 
two  images,  we  set  the  total  iterations  of  unsupervised  and 
supervised  dictionary  learning  methods  to  be  75  and  1000 
respectively.  The  window  size  is  set  to  5  x  5  for  the  joint 
sparse  and  Laplacian  sparse  regularized  methods.  Although 


the  OA  of  most  methods  are  close,  the  OA  of  ODL-JS  and 
ODL-LP  are  still  around  3%  higher  than  that  of  ODL.  The 
TDDL-LP  reach  the  highest  OA  =  98.67%  over  all  the  other 
methods.  The  OA  of  TDDL-JS  (98.01%)  is  slightly  lower 
than  that  of  the  TDDL-LP.  We  notice  that  SRC-JS  (OA  = 
98.01%)  and  SRC-LP  (OA=98.36%)  also  render  competitive 
performance  when  compared  to  TDDL-JS  and  TDDL-LP  due 
to  the  fact  that  the  raw  spectral  features  of  this  image  is  already 
highly  discriminative.  TDDL-LP  outperforms  other  methods 
on  almost  all  classes  and  works  especially  well  for  Class 
4  (Bricks),  achieving  highest  accuracy  of  97.41%.  Except 
for  SRC-LP  where  the  accuracy  is  94.72%,  none  of  others 
reaches  accuracy  over  90%  for  Class  4.  Additionally,  the  AA 
of  TDDL-LP  (97.21%)  is  almost  2%  better  than  that  of  TDDL- 
JS  (95.68%).  These  results  support  our  assertion  that  the 
Laplacian  sparsity  prior  provides  stronger  discriminability  on 
nonhomogeneous  regions.  Performance  comparison  between 
the  SRC-based  and  TDDL-based  methods  have  shown  that 
the  dictionary  size  can  be  drastically  decreased  by  applying 
supervised  dictionary  learning  while  achieving  even  better 
performance. 


VI.  CONCLUSION 

In  this  paper,  we  proposed  novel  a  task  driven  dictionary 
learning  method  with  joint  or  Laplacian  sparsity  prior  for  HSI 
classification.  The  corresponding  optimization  algorithms  are 
developed  using  fixed  point  differentiation,  and  are  further 
simplified  for  ease  of  implementation.  We  also  derived  the 
optimization  algorithm  for  solving  the  Laplacian  sparse  recov¬ 
ery  problem  using  SpaRS  A,  which  improves  the  computational 
efficiency  due  to  the  availability  of  a  more  accurate  descent 
direction.  The  performance  and  the  behavior  of  the  proposed 
methods,  i.e.  TDDL-JS  and  TDDL-LP,  have  been  extensively 
studied  on  the  popular  hyperspectral  images.  The  results 
confirm  that  both  TDDL-JS  and  TDDL-LP  give  plausible 
results  on  smooth  homogeneous  regions,  while  TDDL-LP  one 
works  better  for  classifying  small  narrow  regions.  Compared 
to  TDDL-JS,  TDDL-LP  is  able  to  obtain  a  more  stable 
performance  by  describing  the  similarities  of  neighboring 
pixels’  sparse  codes  more  delicately.  The  results  also  confirm 
that  a  significantly  better  performance  can  still  be  achieved 
when  joint  or  Laplacian  prior  is  imposed  by  using  a  very 
small  dictionary.  The  overall  accuracy  of  our  algorithm  can  be 
improved  by  applying  kernelization  to  the  proposed  approach. 
This  can  be  achieved  by  kernelizing  the  sparse  representation 
]44|  and  using  a  composite  kernel  classifier  |45|. 


VII.  Appendix  A 

We  can  infer  from  Eq.  (j~j~8])  that  vec  ( dA/dDmn )  =  0,  Vn  G 
Ac,  which  indicates  dC/dD mn  =  0,  Vn  G  Ac.  Therefore, 
we  only  need  to  take  the  gradient  dC/dDmn ,  Vn  G  A  into 
account. 

From  the  Eq.  ([13])  and  Eq.  ([18]),  we  achieve  the  gradient  for 
every  element  of  D, 


dC  (dC 

— ~ -  =  vec  — — 

dDmn  \dA 


T 


•  vec 


dA 

r)f) 

UJ-ymn 


(32) 
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(f)  ODL-JS,  OA  =  96.13%  (g)  ODL-LP,  OA  =  97.86%  (h)  TDDL,  OA  =  96.30%  (i)  TDDL-JS,  OA  =  98.01%  (j)  TDDL-LP,  OA  =  98.67% 


Fig.  8:  Classification  map  of  the  Center  of  Pavia  image  obtained  by  (a)  SVM,  (b)  SRC,  (c)  SRC-JS,  (d)  SRC-LP,  (e)  ODL, 
(f)  ODL-JS,  (g)  ODL-LP,  (h)  TDDL,  (i)  TDDL-JS  and  (j)  TDDL-LP 


where  m  =  1, . . M  and  n  =  1, . . . ,  N\.  Let  g  = 

and  W  =  WA  is 


vec 

the  A 


($r)  =  vec  f  (wA  —  y)  T  W 

A  columns  of  W.  Expand  Eq.  (fl8| 


Eq.  ([32]),  we  have 

DC 


.  Expand  Eq.  ©  and  combine  it  with 


0Dn 


—  Umn  Vm 


,  dC 

and  — —  =  U  —  V, 
<9D 


(33) 


where  U,V  E  MmxJVA  and  every  element  Umn ,  Vmn  are 
defined  as 


Umn  =  gT  (dtD  ®  Ip  +  Ar)  1  vec  ((X  -  DA)t  E ron)  , 
Vmn  =  gT  (dtD  0  Ip  +  Ar)  _1  vec  (atE^d)  , 


Consider  the  simplification  for  U  first 

Umn  =  gT  (dtD  0  Ip  +  Ar) _1  (ETn  0  Ip)  vec  ((X  -  DA) 
=  gTF”i;ec  ((X  -  DA)T)  _  ,  (34) 

where  F  =  ^DTD  0  Ip  +  Ar)  ;  m(m)  =  {(m  —  1)P  + 

1, . . . ,  mP},  h(n)  =  {{n  —  1)P+1, . . . ,  nP}  denote  the  index 
sets;  Fn  are  the  n  columns  of  F. 

Let  =  vec  ( (X  —  DA)T  j  .  It  can  be  shown  that 

V  /  fri 

is  the  mih  row  of  (X  -  DA).  Now  the  (m,  n)  element  f/mn 
of  the  first  part  U  can  be  written  as 


where  Emn  e  MMx7Va  is  the  indicator  matrix  that  element 

(m,  n)  of  E mn  is  1  and  all  other  elements  are  zero.  Umn  —  (gTF)  Cm 5 


(35) 
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TABLE  VII:  Classification  accuracy  (%)  for  the  Center  of  Pavia  image 


Dictionary  Size 

iV  =  5536 

N 

=  45 

Class 

SVM 

SRC 

SRC-JS 

SRC-LP 

ODL 

ODL-JS 

ODL-LP 

TDDL 

TDDL-JS 

TDDL-LP 

1 

96.97 

99.58 

99.52 

99.28 

96.26 

99.13 

99.69 

98.54 

98.76 

99.13 

2 

91.09 

90.07 

96.89 

92.11 

84.25 

94.63 

90.63 

89.55 

97.59 

93.01 

3 

96.08 

95.42 

99.47 

98.62 

93.36 

96.23 

97.61 

95.18 

96.85 

98.71 

4 

86.32 

79.96 

78.28 

94.72 

61.61 

64.73 

97.30 

85.78 

85.18 

97.41 

5 

88.57 

93.70 

97.05 

97.14 

89.40 

84.62 

90.00 

88.08 

98.25 

99.59 

6 

95.27 

95.62 

98.19 

97.18 

94.35 

95.03 

94.49 

94.39 

99.36 

99.18 

7 

94.03 

93.86 

97.01 

96.84 

86.31 

86.90 

97.33 

91.65 

94.46 

98.66 

8 

99.83 

99.17 

99.66 

99.66 

96.76 

99.79 

99.00 

98.17 

99.38 

99.73 

9 

85.74 

98.58 

99.19 

99.95 

93.25 

90.56 

94.42 

95.53 

91.27 

95.61 

OA[%] 

95.68 

97.57 

98.01 

98.36 

93.67 

96.13 

97.86 

96.30 

98.01 

98.67 

AA[%] 

93.77 

94.00 

95.03 

97.28 

88.39 

90.18 

95.61 

92.99 

95.68 

97.89 

K 

0.923 

0.961 

0.965 

0.971 

0.899 

0.938 

0.965 

0.940 

0.968 

0.979 

Stacking  all  elements  of  U 


u  = 


-(gTF)^M 


(etf)Na^ 

(gTF 


M- 


dC  \  1  (  dA 

dAj,  'VeC[dDn 


(40) 


A  \  UJ-^mn  J  a 

Expand  Eq.  ([3TJ  and  combine  it  with  Eq.  ([40]),  the  desired 
gradient  is 


=  £  (gTF)iT...(gTF)^ 


(36) 


where  An  denotes  the  nth  element  of  set  A. 

Now  consider  simplification  for  V.  Each  element  Vmn  of 
V  can  be  written  as 


where 


Vmn  =  gTF  •  vec  (ATE^nD j 

=  gTF  ^DTEmn  0  IP)  vec  (AT) 
=  gTF  (D^0lp)  Al, 


9C  _  TT  T i  9C  _  TT  ^ 

07-1  Unin  Vmn  tind  U  V, 

vDmn 


Umn  =  gTF-W  (E^  (X  -  DA))a  , 
vmn  =  gTF_1vec  (DTEmnA)A  , 


(41) 


F  =  (ip  0  DtD  +  7L  0  Ijv 


-1 

*A,A  ' 


(37) 


Let  g  has  the  same  definition  as  that  in  Section  VII  The  first 

df 


IS 


where  An  is  the  nth  row  of  A  and  Dm  is  the  mth  row  of  D. 
Stacking  every  element  of  V,  such  that 


part  u  of 

Umn=grFvec{Eln  (X  —  DA))a 
=  (gTF)flWC(X-DA)-( 


(42) 


V  = 


gTF  (l)J  0IF) 


gTF  (£>],  0lp) 
rEiiD3-n(gTF)BAT 


LXLiPL  (gTF)flAT 
D[(gTF)^...(gTF)T 


gTF  (pj  0  Ip) 
gTF  (f»T  0lp) 


EtiD7„(gTF)ftA 


t  n 

N 


E mn  C  RMxiV  is  the  indicator  matrix  that  the  (m,  n)  element 
is  1  and  all  other  elements  are  zero,  m  and  n  are  defined  as 
the  following  index  sets, 

m(m,  n)  =  {m, . . . ,  m  +  pM, . . .  Vp  s.t.  n  +  pN  G  A 
n(n)  =  {n,  n  +  A, . . . ,  n  +  (P  —  1)N}  D  A 


TtLl^Mn  (gTF)ftATj 

(38) 


Eq.  (42)  can  be  further  simplified  by  introducing  h(n)  G  Mp, 
such  that 


where  p(p)  =  {p,p  +  P, . . .  ,p  +  (N  —  1)P}.  Combining  Eq. 
and  Eq.  pS]) 


h(n)  =  f  (gTF)n+pJV  ,  if  n  +  pN  £  n(n),Vp 

1  0,  otherwise 


|g  =  u  -  V  =  -  D/3aAt  and  =  tf3T  -  D/3AT 

(39) 

where  f3Ac  =  0  and  (3A  =  [(gTF)^  ,  •  •  •  ,  (gTF)^]T.  More 

generally,  we  have  defined  /3A  G  RNaxP  such  that  vec(fi\)  = 

Fg. 

VIII.  Appendix  B 

The  gradient  for  updating  the  dictionary  can  be  written  as 

dA 


Now  Eq.  ([42])  can  be  rewritten  as, 

TT  — 

umn  —  11  s  m) 


(43) 


(44) 


where  £  '  is  the  mth  row  of  X  —  DA.  The  first  part  U  of  the 

df_ 

6>D 

r^hd) 


gradient  Mr  can  be  obtained  by  stacking  all  Umn  in  Eq.  d44 ) 


U  = 


dC  fdC 

dDmn  -  vec  Ua 


0D„ 


=  € 


=  £/3r, 


,&h(1)  ••• 

h«-hW 


x^hW 


(45) 
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where  we  define  (3  =  ,  h^]7"  G  RNxP.  By 

examining  the  nonzero  elements  position  of  h^1), . . . ,  h^, 
it  is  not  difficult  to  find  the  relation  between  (3  and  gTF 


vec  (f3)A  =  Fg  and  vec  {/3)Ac  =  0.  (46) 

Now  consider  the  second  term  Vmn  of 

^  mn 

Vmn  =  (gTF)  (ip  <8>  DTEmn)A  A  vec(A)A 

=  (gTF)  (  [Dmvec(A)n  . . .  D  mvec( A)(n+(p_1)j/v)]  )  A 

p 

=  Dmy^An,p/3p,  (47) 

P=  1 


where  (3P  is  the  pth  column  of  (3.  The  differentiation  can 
be  derived  from  Vmn  in  Eq.  © 


V  = 


Di 


Ep=i  AllP/3p,  •  •  •  ,  Ep=i  A.n,pF 


Dm  [Ep=i  Ai)P  (Fg)p  •  •  •  E;=1  A n>p  (Fg)^ 

"  P  P 

_p= i  p= i 

=  D/3At, 


=  D 


(48) 


Combining  Eq.  <|45ji  and  Eq.  <(48ji,  we  reach  the  gradient  of 
the  dictionary 


dC_ 

<9D 


=  £/3t  —  D/3At. 


(49) 
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